Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code
نویسندگان
چکیده
Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at https://github.com/bdqnghi/hierarchical-programming-language-mapping. We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation.
منابع مشابه
Hierarchical Reasoning with Distributed Vector Representations
We demonstrate that distributed vector representations are capable of hierarchical reasoning by summing sets of vectors representing hyponyms (subordinate concepts) to yield a vector that resembles the associated hypernym (superordinate concept). These distributed vector representations constitute a potentially neurally plausible model while demonstrating a high level of performance in many dif...
متن کاملTowards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Current approaches to learning vector representations of text that are compatible between different languages usually require some amount of parallel text, aligned at word, sentence or at least document level. We hypothesize however, that different natural languages share enough semantic structure that it should be possible, in principle, to learn compatible vector representations just by analy...
متن کاملWiktionary-Based Word Embeddings
Vectorial representations of words have grown remarkably popular in natural language processing and machine translation. The recent surge in deep learning-inspired methods for producing distributed representations has been widely noted even outside these fields. Existing representations are typically trained on large monolingual corpora using context-based prediction models. In this paper, we p...
متن کاملTitle: Miniature Language Learning via Mapping of Grammatical Structure to Visual Scene Structure in English and Japanese
The current research demonstrates a system that employs relatively simple learning mechanisms to construct mappings between structure in visual scenes, and the grammatical structure of sentences that describe those scenes. These learned mappings allow the system to processes natural language sentences in order to reconstruct complex internal representations of the visual scenes that those sente...
متن کاملEvolving Distributed Representations for Language with Self-Organizing Maps
We present a neural-competitive learning model of language evolution in which several symbol sequences compete to signify a given propositional meaning. Both symbol sequences and propositional meanings are represented by high-dimensional vectors of real numbers. A neural network learns to map between the distributed representations of the symbol sequences and the distributed representations of ...
متن کامل